Study guide

PSY721 – Tests and Measurements

Author

Nils Myszkowski

Published

December 2, 2025

Warning

This covers aspects of the class considered important and not self-evident, but it should not be considered as the only document to review.

Note

This document is updated regularly. Check back soon!

1 What to review

For multiple choice exams (including the Qualifying Exams and Comprehensive Exams), you do not need to know the following:

  • How to execute analyses in SPSS or another software
  • How to write in mathematical form the different CTT models (however the conceptual differences between them are expected to be understood)

2 Introduction

2.1 Psychometric terminology

  • The psychological attribute being measured is referred to as the construct.

  • To increase accuracy, dimensions are measured through responses on multiple indicators, called items.

  • When persons are administered a test, it yields item responses.

  • If the measurement tool is intended to measure multiple facets of the construct, it may be called an inventory, and the measure of one of the facets is often called a subscale.

  • The properties of psychological tests are investigated through the discipline of psychometrics, and these properties are generally referred to as psychometric (or metrological) qualities.

2.2 Fundamental statistics for this class

2.2.1 Levels of measurement

  • Nominal/categorical
    • Special case: Dichotomous/binary (2 categories)
    • e.g., multiple choice
  • Ordinal (response categories that can be ordered)
    • e.g., Likert scales
  • Interval (the interval between responses is useful information)
    • e.g., an extroversion scale’s sum scores
  • Ratio (the value \(0\) is meaningful, in that it implies the absence of the quantity of interest)
    • e.g., response time in seconds

2.2.2 Manifest and latent variables

A manifest variable is a variable that can be directly observed. In psychometrics, item responses (e.g., a person’s answer “yes” to “I often feel sad”) are considered to be manifest variables.

A latent variable is a variable that cannot be directly observed. In psychometrics, psychological attributes (e.g., a person’s depression) are considered to be latent variables.

Latent variables that are assumed to be continuous (e.g., depression levels on a continuum) are called latent trait variables, while latent variables that are assumed to be discrete (e.g., types of depression presentation) are referred to as latent class variables. Most psychometric models assume that latent variables are traits.

2.2.3 Types of statistics

  • Univariate vs. bivariate vs. multivariate
    • Univariate statistics study one variable at a time (e.g., the mean score of item 1).
    • Bivariate statistics study the relation between 2 variables (e.g., the correlation between item 1 and item 2).
    • Multivariate statistics study the overall relations between more than 2 variables (e.g., a factor analysis of a 10-item instrument).
  • Descriptive vs. inferential
    • Descriptive statistics are used to describe phenomena observed in the sample.
    • Inferential statistics are used to draw conclusions in the population.

2.2.4 Note on correlations

Correlations are largely used in psychometrics, especially to study reliability and validity.

A correlation coefficient \(r\) is a statistic that represents the strength and direction of the relation between 2 numeric variables (e.g., 2 item scores).

Correlations range between \(-1\) (perfect negative) and \(+1\) (perfect positive):

  • Strength

    • Negligible/Null : \(|r|<.10\)
    • Weak/Small : \(.10 <|r|<.30\)
    • Moderate : \(.30 < |r|<.50\)
    • Strong/Large : \(.50 < |r|\)
    • The stronger the correlation, the more aligned the points.
  • Direction

    • Positive : \(r>0\)

      • When one variable increases, the other tends to increases
    • Negative : \(r<0\)

      • When one variable increases, the other tends to decrease

3 Test construction process

4 Item construction

4.1 Response formats

Typical response formats:

  • Likert scales (ordinal response)
  • Multiple Choice Questionnaire (transformed to binary usually)
  • True/false (binary response)
  • Open-ended questions (usually uses raters, which turn measures into binary or ordinal scales)
  • Forced-choice (categorical response)
  • Time to solve a problem (continuous response)
  • Visual analog scales (continuous response)

To use average or sum scoring, all the items should have the same response format.

4.2 How to build items

Items are in general built so that item responses are manifestations of the construct (i.e., are caused by the construct, see causal theory of measurement).

For example, it is because a person is extroverted (construct) that they will answer “yes” (item response) to “Do you like going out with friends?” (item).

Two main approaches (which can be mixed) to item construction:

  • Deductive approach: Items are designed from theory/previous research.
  • Inductive approach: Items are designed from asking a sample of subject matter experts various questions about how the construct manifests (e.g., to measure clinical depression, we could ask a panel of clinical psychologists how they detect depressive symptoms in their clients).

In general, at this stage we want to maximize content validity, primarily (see section on validity and content validity).

4.3 Response biases

Common biases in psychological testing:

  • Social desirability bias: Tendency of respondents to respond in order to present themselves more or less favorably.

    • Common problem of self-report measures
    • Comprised of self-deception (biased opinion of oneself) and other-deception (“faking”)
    • Context specific (e.g., higher in recruitment contexts) and construct specific (e.g., some attributes are more desirable than others).
    • Solutions notably include using neutral items, using forced-choice response formats, or social desirability (i.e. lie) scales
  • Comprehension biases (complexity, double-barreled items, ambiguity, etc.)

  • Response format biases

    • Central tendency vs. extreme response bias (tendency to prefer central or extreme responses)
    • Acquiescence bias (solutions include notably using negatively worded items)
  • Emotional biases (negative or positive emotions can induce bias)

  • Applicability bias (items need to be equally applicable and appropriate to any individual of the aimed population)

  • Cultural biases (may lead to discriminatory practices)

  • Stereotype bias (when an individual is put in a testing situation that may involve negative stereotypes about their performance in similar tests, they tend to underperform).

  • Non-response bias (when the individuals that do not respond a specific item or a test have different characteristics that individuals who respond).

4.4 Scoring biases

Biases may be induced by the use of raters:

  • Order biases (biased induced by the order of rating responses, e.g., contrast bias)
  • Consistency bias (raters tend to seek a consistent view of respondents)
  • Expectancy bias (raters tend to use expectations about a respondent’s performance in their judgement)

5 Test theory

5.1 Causal theory of measurement

The Causal Theory of Measurement (CTM) is a core underlying assumption of Classical Test Theory (as well as Modern/Item-Response Theory).

According to this theory, observed item responses are caused by unobserved (i.e., latent) attributes (called the true score in Classical Test Theory, called a latent variable in Modern/Item-Response Theory).

5.2 Classical test theory (CTT)

5.2.1 General assumptions

CTT assumes that (observed) item scores \(X\) are comprised of a true score \(T\) (an error-free score, or expected score) and a random error \(E\):

\[ X = T + E \]

The error \(E\) is assumed to a random draw from a Gaussian (Normal) distribution, of mean \(0\) and error variance \(\sigma^2\) (the error variance is fixed for a given item, but for certain models it is also fixed across items). An implication of this assumption is that item scores are distributed according to Gaussian (Normal) distributions.

\[ E \sim \text{Normal}(0, \sigma^2) \]

The true score \(T\) corresponds to what a person’s score would be without any measurement error. In general, we approximate true scores in CTT by averaging many observed scores (e.g., averaging scores across items). Since scales and locations of psychological traits are arbitrary, sum scores may also be used.

To note, the equation presented implies a linear relation between \(T\) and \(X\) : The item response changes linearly with the true score. Modern (Item-Response) Theory allows non-linear relations, as well as non-Gaussian items.

5.2.2 Models

Note

It is not necessary to review this section for the Comprehensive Examination.

There are several more or less constrained variants of CTT (Lord et al., 1968). What changes between them is whether the error variance and the true score changes by item. Throughout here, the person is represented with \(i\), and the item with \(j\).

  • Parallel CTT (most constrained)

    • Items are completely interchangeable (nothing differs by item).
    • Average scores can be used as estimators of the true score.
    • Reliability can be predicted by the number of items (see Spearman-Brown reliability prophecy formula).

  • Tau-equivalent CTT

    • Items differ in their error variance \(\sigma_j^2\) : \(E \sim \text{Normal}(0, \sigma^2_j)\). The true score remains the same across items.
    • Average scores can be used as estimators of the true score.
    • Reliability cannot anymore be predicted from the number of items.

  • Essentially tau-equivalent CTT

    • Items differ in their error variance and true score. The true score is given as a person true score and an item location parameter: \(T_{ij} = T_i + b_j\).
    • Compared with the previous model, two sets of items have different expected scores (i.e., a set of items may be more difficult than another).
    • Average scores are not estimators of the true score, but, given a set of items, they are perfectly correlated to true scores (so average scores can be used as proxies of true scores to compare relative individual locations).

  • Congeneric CTT (least constrained)

    • Items differ in their error variance and true score. The true score is given as a linear function of the person true score, item location parameter, and item slope/discrimination/loading parameter \(T_{ij} = a_jT_i + b_j\).
    • Average scores are not anymore perfectly correlated with the true score. Other methods of scoring (i.e., factor scoring) are advised (McNeish & Wolf, 2020).

5.3 Other approaches

5.3.1 Item response theory

Item-response theory (IRT, i.e., modern test theory) is the most common alternative framework to CTT.

Like CTT, IRT describes item responses as a mathematical function of person attributes and item parameters (Ayala, 2013). Unlike CTT, IRT allows this function to be non-linear and the item distribution to not be assumed Normal.

IRT may be seen as a generalization of CTT models (Mellenbergh, 1994). It can be broadly defined as “a psychological measurement framework that uses statistical modeling to explain observed item responses using unobserved characteristics of persons and items” (Myszkowski, 2024, p. 25).

A typical example of IRT models are logistic models (see below), which are notably used for binary item responses (e.g., pass-fail) and ordinal models (e.g., Likert scales). For these the probability of responses (i.e., expected item score) is given as a function of the latent attribute (noted \(\theta\) or \(\theta_i\) traditionally in IRT).

5.3.2 Behavior domain theory

According to behavior domain theory, items are not caused by psychological attributes, but they form them. Behavior domain theory uses models that are called formative models.

5.3.3 Network psychometric models

In network psychometric models, there are no latent variables causing item scores, and the relations between items are directly investigated.

6 Item analysis

6.1 Item variability

In testing, we measure attributes that are supposed to vary across individuals (e.g., extroversion). Thus, tests (and their items) are expected to yield different scores for different individuals. The extent to which a test (or an item) does so it called discrimination (or discriminating) power.

In general, discriminating power is studied by using univariate statistics (i.e., studying the distribution) of the items and the scores. These typically include:

  • Location (i.e., central tendency) indices
    • Binary items: P-value (proportion of correct responses)
    • Interval/ordinal items: Mean, Median, Mode
    • Better variability is observed when location indices are not too extreme (i.e., not too close to the lower or upper bounds)
  • Dispersion indices
    • Range
    • Standard deviation/variance
    • Better variability is observed when dispersion indices are large.
  • Graphical methods
    • Frequency histograms/Frequency (bar) plots

6.2 Item discrimination

Note

It is not necessary to review this section for the Comprehensive Examination.

Item discrimination is the extent to which an item is able to differentiate between individuals with different levels of the construct.

In general, item discrimination is studied by using correlations between item scores and total scores (or subscale scores), which are called item-total correlations.:

  • Binary items: Point-biserial correlation \(r_{pb}\)
  • Ordinal/interval items: Pearson correlation \(r\)

Instead of the total score, it is common (and advisable) to use the rest score, which is the score of the rest of the items (i.e., the score without the item studied), to compute the correlations, which in this case are referred to as corrected item-total correlations.

Values of item-total correlations are generally expected to be positive and above \(.3\).

6.3 Distractor analysis

Note

It is not necessary to review this section for the Comprehensive Examination.

Distractor analysis is used for multiple-choice items. We typically evaluate the effectiveness of the incorrect options in attracting responses from lower-performing test-takers.

We do so by analyzing the proportion of test-takers who chose each distractor at different levels of the total score (or rest score). In general, we want all distractors to be chosen by some test-takers, but not by high performers.

7 Reliability

A measure has good reliability if it attributes consistent scores to a given person. In other words, reliability is the degree to which a measure is free from random error.

7.1 Notation

Reliability is frequently noted as \(\rho_{xx'}\).

However, reliability is generally unknown and estimated from a dataset. Estimators of reliability are often written as \(r_{xx'}\).

7.2 Conceptual definition

Reliability is viewed as the ratio of the true score variance \(\sigma^2_T\) over the observed score variance \(\sigma^2_X\):

\[ ρ_{xx′} = \frac{\sigma^2_T}{\sigma^2_X} \]

In practice, it cannot be calculated directly, as \(\sigma^2_T\) is unknown. We use instead estimators of reliability. In general, they are based on correlations between repeated measurements.

7.3 Relation to standard error of measurement

While reliability is generally presented as a feature of a test in general, the accuracy of a person’s score is generally presented through the standard error of measurement (or through confidence intervals). The standard error of measurement is directly related to reliability: The higher the reliability, the smaller the standard error of measurement (and the more narrow the confidence intervals).

7.4 Test-retest reliability

In test-retest reliability (i.e., time stability), we verify that a person obtains scores that are consistent when they are measured several times.

The typical design to study it consists in having the same sample of respondents take the instrument twice (or more, but usually twice), with a time interval (usually 1 to 4 weeks).

Good test-retest reliability is achieved if a (very) strong high correlation (generally \(r>.70\) ) is obtained between the time points.

7.5 Parallel forms reliability

Some tests exist in multiple similar (“parallel”) forms (in general, to avoid test contamination and biases induced from training).

Parallel forms reliability is studied by having the same sample of respondents take the two (or more) parallel forms.

Good parallel forms reliability is achieved if a (very) strong high correlation (generally \(r>.70\) ) is obtained between the scores of the forms.

7.6 Inter-rater reliability

Some tests require the intervention of raters, which may induce some random error.

Inter-rater reliability is studied by having the same sample of responses judged by a group of raters (2 or more).

Good inter-rater reliability is achieved if a (very) strong high correlation (generally \(r>.70\) ) is obtained between the scores of the raters.

7.7 Internal reliability

Within a test, each item may be considered an indicator of the construct in of itself. Thus, there is some random error attached to each item.

Internal reliability is studied by administering the test in a sample of respondents. Internal reliability indices capture, using different approaches, the extent to which items produce consistent scores.

The most used measure of internal reliability is Cronbach’s \(\alpha\) (Cronbach, 1951). In general, values above \(.70\) are considered sufficient (the maximum is \(1\)).

In general, \(\alpha\) tends to increase with the number of items, and to decrease if the instrument is not unidimensional. It assumes the essentially tau-equivalent model. A common alternative to it is McDonald’s omega (\(\omega\)) (McDonald, 1999) which can be computed from different models (e.g., congeneric).

Internal reliability indices are often used as decision rules to discard items. For example, the “alpha if deleted” method consists in discarding items without which Cronbach’s \(\alpha\) would be higher.

7.8 Predicting reliability

Once we have an estimate for reliability with a given number of items, assuming that all items are completely interchangeable (i.e., parallel CTT model), we can use the Spearman-Brown formula (Brown, 1910; Spearman, 1910) to compute reliability for any number of items.

Note: Assuming all items being interchangeable, reliability increases with the number of items.

7.9 Reliability in item-response theory

In Item-Response Theory, reliability is not constant for a group of respondents, but varies by person. Thus, reliability (as well as standard errors of measurement and confidence intervals) is generally studied as a function of the person’s latent attribute (although group-level reliability measures can also be computed).

8 Validity

8.1 General definition

Validity is generally defined as the extent to a test measures what it purports to measure.

Test validity can also be seen as the extent to which there is sufficient evidence to interpret the results of a test for its intended purpose (Messick, 1995).

8.2 Content validity

During the test process, we want to ensure that we build items that capture the construct of interest.

For this, we often use Subject Matter Experts (SMEs) to rate items. A typical procedure consists in asking a panel of SMEs to rate items on a 3 point scale (item is not useful/useful but not essential/essential), and to calculate for each item a Content Validity Ratio (CVR).

In general, CVRs at least above \(0\) and as close to \(1\) as possible are desirable. CVRs can be significance tested, and we may choose to retain items with CVRs that are significantly above \(0\) .

8.3 Structural validity

Structural validity (i.e., factorial validity) is the capacity of a test to produce items scored whose structure is in adequation with the theoretical structure of the test (e.g., if a test is supposed unidimensional, the empirical structure of the test should be unidimensional).

8.3.1 Overview of common structures

The name of the structures is generally given the number of latent attributes. They are often referred to as factors in this context.

  • Unidimensional structure

    • one factor explains all item scores

  • Multidimensional structure

    • different factors explain different item scores (i.e., each item score is explained by one of the factors)

    • the factors may be correlated or independent

  • Hierarchical structure

    • multidimensional structure, where the factors are not directly correlated but explained by a general (second-order) factor

  • Bifactor structure (Holzinger & Swineford, 1937)

    • each item score is explained by a general factor and one of several specific factors

8.3.2 Correlation matrices

One approach to studying structural validity is to inspect the correlations between items, which is often done using a correlation matrix, where items that are expected to be clustered together (i.e., items that are supposed to be in the same scale/facet) are grouped together.

8.3.3 Exploratory factor analysis (EFA)

Exploratory Factor Analysis is an exploratory method that consists in identifying and interpreting a set of factors that parsimoniously explain the relations between the items.

  • Methods for estimating the factors (extraction methods)

    • Maximum Likelihood Estimation (preferred for continuous/ordinal with more than 4 response categories)
    • Principal Axis Factoring (preferred for binary/categorical/ordinal with 4 or less response categories)
    • Principal Components Analysis (outdated for psychometrics investigations)
  • Methods for obtaining a parsimonious number of factors

    • Kaiser’s K1 criterion (removes factors with eigenvalues below 1)

    • Cattell’s scree test (graphical technique that involves sorting factors by eigenvalues, and dropping all factors after a sharp drop in eigenvalues is observed) (Cattell, 1966)

    • Horn’s parallel analysis (consists in simulating factors and retaining factors whose eigenvalues are higher than the simulated factors, or a certain percentile of their distribution) (Horn, 1965)

    • Interpretability (rejects factors that are not clearly interpretable — see below)
  • Rotation methods (facilitate interpretation)

    Rotations are only performed when more than 1 factor is extracted.

    • Orthogonal rotations (e.g., varimax rotation) assume factors to be uncorrelated, should be used if there is a strong rational for hypothesizing independent factors

    • Oblique rotations (e.g., oblimin rotation) do not assume factors to be uncorrelated, should be used if the relations between factors are presumed unknown or if we hypothesize correlated factors

  • Interpretation

    • Factor loadings represent the strength of the relation between items and factors. They are scaled similarly as a correlation coefficient.

    • In general, we want to look at tables that show loadings by item and factors. If an item has a strong loading (\(>.50\) in absolute value) with a factor, we interpret this as the item being well represented by this factor.

    • By looking at the content of the items and which factor they are the best explained by, we can (in most cases) interpret the factors conceptually (i.e., understand what attribute they correspond to).

    • Items that do not have strong loadings on any retained factor are often discarded.

    • Items that have strong loadings on a factor that they are not expected to be related to (we often referred to these as “cross-loadings” are also often discarded.

Factor loadings (here, loadings indicate that factor 1 represents “cog” and factor 2 represents “aff”).

8.3.4 Confirmatory factor analysis (CFA)

Confirmatory factor analysis consists in specifying a priori a factor structure (or several alternate structures), and investigating whether it explains well the relations between the items.

CFA is generally performed using Structural Equation Modeling (SEM), which is more extensively covered in later classes.

CFA notably produces:

  • Factor loadings for all items, which are similarly interpreted as in EFA, and can be used to retain or discard items

  • Fit indices

    • The \(\chi^2\) test should be non-significant (i.e., \(p\) should be larger than \(.05\)).
    • CFI, TLI should be above \(.90\) (the higher the better)
    • RMSEA, SRMR should be below \(.08\) (the lower the better)

8.3.5 Measurement invariance

Factor structures remain general assumptions about how a test works in a population. These assumptions may be more or less appropriate to different persons. The assumption that the same model (with the same parameters) represents accurately the functioning of a test for an entire population is referred to as the assumption of measurement invariance.

Structures and their parameters may vary by individual (e.g., it may vary by a demographic variable), and the study of these variations if generally referred to as the study of measurement invariance (differential item functioning in Item Response Theory).

8.4 External validity

In general, when instruments are developed, we have expectations regarding what construct is measured and what the scores should be (and should not) related to. External validity is the extent to which an instrument yields scores that are related to other measures in a manner that is conceptually consistent.

The measures that are used for comparison with our instrument are generally referred to as validity criteria.

In general, these comparisons are made using correlation coefficients.

8.4.1 Concurrent validity

In concurrent validity analysis, the instrument and the criterion are administered at the same time.

  • Convergent validity verifies that an instrument and a criterion that should be related are indeed empirically correlated (e.g., the instrument you are building for depression is positively correlated with a measure of emotional exhaustion).

  • Divergent (or discriminant) validity verifies that an instrument and a criterion that should be related are indeed empirically correlated (e.g., the instrument you are building for depression is not correlated with a measure of openness)

8.4.2 Predictive validity

In predictive validity analysis, the instrument is administered before the criterion. In general, analyses of predictive validity are used when the instrument is used in a prediction context (e.g., to predict health outcomes, educational outcomes).

8.4.3 Diagnostic validity

Diagnostic validity is the ability of a test to perform well of accurately categorizing individuals in clinical vs. non-clinical groups. In general we study both a test and its threshold for significance.

Diagnostic validity is maximized when we have both high sensitivity and specificity:

  • Sensitivity is the ability of the test to identify correctly those who do have the disorder. (maximizing true positives)
  • Specificity is the ability of the test to identify correctly those who do not have the disorder. (maximizing true negatives)

Types of error:

  • False positive (i.e. type I error): Concluding that an individual has the disorder when the individual does not have the disorder.
  • False negative (i.e. type II error): Concluding that an individual does not have the disorder when the individual has the disorder.
Test Disorder No disorder
Positive True positive False positive (type I error)
Negative False negative (type II error) True negative

In general, moving the threshold yields a trade-off between sensitivity and specificity (i.e., maximizing sensitivity tends to minimize specificity, and vice-versa).

Depending on the context, we may favor specificity or sensitivity, or aim for a balance.

9 Practical use of tests

9.1 Test administration

Test administration procedures notably aim to reduce biases.

We generally consider preparing:

  • The test environment (ventilation, space, noise, disruptions, material, etc.)
  • The test taker (avoid emotional biases, stereotype threat bias, social desirability bias, put effort into engaging interest and cooperation, avoid test sophistication bias by providing explicit instructions)
  • The rater/test scorer (provide clear instructions for rating)

9.2 Test scoring

9.2.1 Norm-referenced vs. criterion referenced tests

Scores are usually interpreted:

  • By comparing the person’s responses with the responses of others (e.g., compare with reference group means, compare using percentile rank).

    • In these cases the test is called a norm-referenced test.
    • Example: SAT,GRE,WAIS
  • By using a fixed threshold decided in advance (e.g., a person has to succeed 70% of the items to pass).

    • In these cases the test is called a criterion-referenced test.
    • Example: Driving licence examinations, most (non-“curved”) examination tests.

9.2.2 Standard scores

Norm-referenced tests are often interpreted by transforming scores into standard scores. This allows easier interpretation and comparisons across tests.

The most basic format for standard scores is the \(z\) score. \(z\) scores are made so that they have in the reference group a mean of \(0\) and standard deviation of \(1\). Therefore, scores represent directly the distance to the mean (e.g., a \(z\) score of \(−.04\) means that the individual’s score is \(.04\) standard deviations lower than the mean of their reference group).

Once a score has been \(z\)-transformed, it may be attributed a different mean and standard deviation. Common examples include:

  • IQ scores have a mean of \(100\) and standard deviation of \(15\)
  • \(T\) scores have a mean of \(50\) and standard deviation of \(10\)
  • CEEB scores (e.g., SAT) have a mean of \(500\) and standard deviation of \(100\)

They present the same advantages of \(z\) scores, but tend to facilitate communication of results to stakeholders.

9.3 Score uncertainty

Since all psychological measures are imperfect, it is important to measure and communicate the uncertainty about scores.

Common approaches include:

  • Using the standard error of measurement

    • Represents the typical distance between a person’s observed score and their true score
    • For a given test, standard errors get smaller as reliability increases.
  • Using confidence intervals

    • Calculated so that it is estimated that, if the person were measured many times, 95% (or 99%, or 90%) of their scores would fall within the interval.

    • For a given test (and confidence level), confidence intervals are narrower as reliability increases.

9.4 Alternate scoring strategies

An alternate to sum/average scoring (and standardizing) scores consists in using latent variable models that can produce estimates for the person attributes.

This procedure is generally referred to as factor scoring (or using factor scores).

Common latent variable models that can produce factor scores:

  • Exploratory factor analysis
  • Confirmatory factor analysis
  • Item-response theory models

The models used generally yield scores that are already standardized (on a \(z\) score scale). In general estimates of uncertainty (e.g., standard error of measurement) can also be obtained.

Note that the quality of the scores depends on the quality of the model, as well as of the data it was estimated on.

10 Ethics

Note

It is not necessary to review this section for the PSY721 Final Exam.

The ethics of psychological testing for psychologists are described in the Ethical Principles of Psychologists and Code of Conduct (Standard 9) (http://www.apa.org/ethics/code/).

Key principles:

  • Psychologists shall use valid tools that are not outdated.
  • Information about psychometric qualities should be provided by test developers.
  • The choice of tests should be justified, notably by psychometric qualities.
  • Psychologists remain responsible for the quality of their instruments and their interpretation.
  • Psychologists take into account the limitations of test results in their practice and decisions.
  • Psychologists have to obtain informed consent to use tests.
  • Psychologists must inform test takers in understandable language.
  • Unless there is a lawful good reason (protecting people), test data must be provided to the client.
  • Psychologists must give and explain results, unless it is explained in advance and justified by the purpose of the test.
  • Psychologists should respect the integrity and security of test materials.
  • Psychologists are responsible for the proper use of tests if not done by them.

11 References

Ayala, R. J. de. (2013). The IRT Tradition and its Applications. Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199934874.013.0008
Brown, W. (1910). Some Experimental Results in the Correlation of Mental Abilities. British Journal of Psychology, 1904-1920, 3(3), 296–322. https://doi.org/10.1111/j.2044-8295.1910.tb00207.x
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1(2), 245–276. https://doi.org/10.1207/s15327906mbr0102_10
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. https://doi.org/10.1007/BF02310555
Holzinger, K. J., & Swineford, F. (1937). The bi-factor method. Psychometrika, 2(1), 41–54. https://doi.org/10.1007/BF02287965
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179–185. https://doi.org/10.1007/BF02289447
Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Statistical theories of mental test scores. Addison-Wesley.
McDonald, R. P. (1999). : A unified treatment. Psychology Press. https://doi.org/10.4324/9781410601087
McNeish, D., & Wolf, M. G. (2020). Thinking twice about sum scores. Behavior Research Methods, 52(6), 2287–2305. https://doi.org/10.3758/s13428-020-01398-0
Mellenbergh, G. J. (1994). Generalized linear item response theory. Psychological Bulletin, 115(2), 300–307. https://doi.org/10.1037/0033-2909.115.2.300
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749. https://doi.org/10.1037/0003-066X.50.9.741
Myszkowski, N. (2024). Item Response Theory for Creativity Measurement. Cambridge University Press. https://doi.org/10.1017/9781009239035
Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3(3), 271.